Topic Modelling on PubMed Abstracts using BERTopic¶

In this notebook, we'll use BERTopic to model topics on the same corpus of PubMed abstracts that we used with modelling with Latent Dirichlet Allocation in the previous exercise (https://github.com/eukairos/topic-models/blob/main/PubMed_LDA_5K.ipynb). Because BERTopic is modular, I could run two models concurrently and compared them too. Along the way, we'll encounter some new concepts as well.

In [1]:
# first check that GPU is available
import torch
torch.cuda.is_available()
Out[1]:
True
In [2]:
# instantiate the embedding model
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer('neuml/pubmedbert-base-embeddings', device='cuda')

Load data¶

import pandas as pd data = pd.read_csv('pubmed_abstracts.csv') to_drop = ['Title','pmid','meshMajor', 'meshid', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z'] data = data.drop(to_drop, axis=1) data = data.sample(n=5000, random_state=42) data = data.reset_index(drop=True) data['abstractText'] = data['abstractText'].str.lower()

do a bit of cleaning by removing numbers¶

import re def remove_numbers(series): def rem_no(text): pattern = r'\b\d+(.\d{1,2})?\b' cleaned_text = re.sub(pattern, '', text) cleaned_text = cleaned_text.strip() return cleaned_text return series.apply(rem_no)

data['no_numbers'] = remove_numbers(data['abstractText']) abstracts = data['no_numbers'].to_list()

In [4]:
# generate the embeddings
embeddings = embed_model.encode(abstracts)

The best combination of parameters were discovered in a separate session (Refer to https://github.com/eukairos/topic-models/blob/main/Best%20UMAP%20n%20HDBSCAN%20hyperparameters%20for%20PubMed.ipynb) and are:
UMAP: n_components = 20, n_neighbors = 10
HDBSCAN: min_cluster_size = 25, min_sample = 10
We'll fit these into our BERTopic pipeline.

Dimensionality Reduction¶

In [5]:
from bertopic import BERTopic
from umap import UMAP

reducer = UMAP(
        n_components=20,
        n_neighbors=10,
        min_dist=0.0,
        metric="cosine",
        random_state=42,
    )

reduced = reducer.fit_transform(embeddings)

Clustering¶

What actually is clustering in BERTopic's context? Essentially, similar documents (or more specifically, their embeddings) are merged together into a superdocument (a 'document cluster'). Thereafter, the BERTopic algorithm only deals with the superdocument clusters.

In [6]:
from hdbscan import HDBSCAN

clusterer = HDBSCAN(
        min_cluster_size=25,
        min_samples=5,
        cluster_selection_method='eom',
        metric="euclidean",
        gen_min_span_tree=True,
    ).fit(reduced)

Representation Model¶

By 'representation' BERTopic means representation of what topics are. The default process is:

  1. Create a Bag-of-Words for each superdocument using CountVectorizer,
  2. Computes term frequency-inverse document frequency on the superdocument (hence the 'cluster-Tfidf' name). That is, it calculates the importance of words in the superdocument relative to other clusters, thereby identifying the most representative keywords for each topic.
    BERTopic's documentation also suggests delaying the CountVectorizer step to after training the model. The advantage of doing so is that we can further fine-tune the representation by using .update_topics(), thereby allowing us to tweak topic representations without having to re-train our models. In fact, that is what we have done in this exercise. Tne cell below is the instantiation of the representation models.
In [7]:
# Instantiate the BERTopic models.

from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
import openai

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)

# Model 1: Default (ctfidf only)
topic_model = BERTopic(
    embedding_model=embed_model,
    umap_model = reducer,
    hdbscan_model = clusterer,
    ctfidf_model = ctfidf_model)

# Model 2: KeyBERT Representation
representation_model_keybert = KeyBERTInspired()
topic_model_keybert = BERTopic(
    embedding_model = embed_model,
    umap_model = reducer,
    hdbscan_model = clusterer,
    ctfidf_model = ctfidf_model,
    representation_model = representation_model_keybert
    )

While the default representation model is based on clustered Tfidf (ctfidf), KeyBERTInspired is an alternative representation model that selects topic keywords based on semantic similarity rather than statistical frequency. It takes all candidate keywords from documents in a topic, computes embeddings, then computes a centroid embedding for the topic (average of all document embeddings in that topic), and reranks keywords by cosine similarity between their embeddings and the topic centroid. In this way, it captures semantic coherence, and handles synonyms better. It is also less sensitive to document length - ctfidf is, like tfidf, is sensitive to document length.

Other representation models include MaximalMarginalRelevance, which diversifies keywords in a topic, so that the same word won't appear in different topics. You may recall from our LDA exercise that the word "cell" keeps popping up in different topics.

It is also possible to chain more than one representation model together. You can read about it in the BERTopic documentation (https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#chain-models).

In [8]:
# fit the models on the embeddings
topics_1, probs_1 = topic_model.fit_transform(abstracts, embeddings)
topics_2, probs_2 = topic_model_keybert.fit_transform(abstracts, embeddings)
In [9]:
# check outputs of models to see if they make sense
topic_model.get_topic_info()[1:11]
Out[9]:
Topic Count Name Representation Representative_Docs
1 0 214 0_isolates_vaccine_virus_cattle [isolates, vaccine, virus, cattle, strains, in... [chronic wasting disease (cwd) is a fatal prio...
2 1 132 1_visual_task_movement_walking [visual, task, movement, walking, stimulus, sp... [objectives: noise often has detrimental effec...
3 2 121 2_images_image_imaging_measurements [images, image, imaging, measurements, phantom... [background: endovascular aortic procedures ha...
4 3 116 3_soil_water_wastewater_concentrations [soil, water, wastewater, concentrations, orga... [quinclorac, a highly selective auxin herbicid...
5 4 115 4_gastric_pylori_laparoscopic_fecal [gastric, pylori, laparoscopic, fecal, cholecy... [the surgical standard for ulcerative colitis ...
6 5 113 5_strains_oil_enzyme_cga [strains, oil, enzyme, cga, extract, lipase, p... [oleaginous fungi are of special interest amon...
7 6 108 6_recurrence_lymph_node_pet [recurrence, lymph, node, pet, nodes, survival... [background: no agreement has been made about ...
8 7 106 7_nk_cd4_leukemia_il [nk, cd4, leukemia, il, aml, cells, ifn, cd8, ... [neoplastic disorders sometimes accompany a re...
9 8 105 8_ventricular_aortic_myocardial_heart [ventricular, aortic, myocardial, heart, cardi... [background: we studied the effects of diabete...
10 9 97 9_eyes_lens_cataract_corneal [eyes, lens, cataract, corneal, glaucoma, intr... [aims: to quantify the rates of eye preservati...
In [10]:
topic_model_keybert.get_topic_info()[1:11]
Out[10]:
Topic Count Name Representation Representative_Docs
1 0 214 0_sera_sj26_antibody_immunization [sera, sj26, antibody, immunization, rotavirus... [chronic wasting disease (cwd) is a fatal prio...
2 1 132 1_information_ankle_verbal_feedback [information, ankle, verbal, feedback, observe... [objectives: noise often has detrimental effec...
3 2 121 2_reconstruction_cm2_breast_lesions [reconstruction, cm2, breast, lesions, reconst... [background: endovascular aortic procedures ha...
4 3 116 3_water_lafeo3_groundwater_efficiency [water, lafeo3, groundwater, efficiency, matte... [quinclorac, a highly selective auxin herbicid...
5 4 115 4_cholecystitis_appendicitis_esophageal_laparo... [cholecystitis, appendicitis, esophageal, lapa... [the surgical standard for ulcerative colitis ...
6 5 113 5_yeast_indol_enzymes_amino [yeast, indol, enzymes, amino, enzyme, acrylam... [oleaginous fungi are of special interest amon...
7 6 108 6_prognosis_prognostic_toxicity_follow [prognosis, prognostic, toxicity, follow, rare... [background: no agreement has been made about ...
8 7 106 7_amh_mtb_tuberculosis_leflunomide [amh, mtb, tuberculosis, leflunomide, cgat, gl... [neoplastic disorders sometimes accompany a re...
9 8 105 8_aneurysm_aortic_echocardiography_aaa [aneurysm, aortic, echocardiography, aaa, metf... [background: we studied the effects of diabete...
10 9 97 9_conjunctival_glaucoma_rnn_glaucomatous [conjunctival, glaucoma, rnn, glaucomatous, uv... [aims: to quantify the rates of eye preservati...

Updating Topics¶

BERTopic suggests removing stopwords at this step, leveraging on sklearn's CountVectorizer's inbuilt stopwords attribute. As in our previous exercise, we also add our custom PubMed stopwords to the native English stopwords.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
pubmed_stopwords = pd.read_csv('pubmed_stopwords.csv')
combined_stopwords = list(ENGLISH_STOP_WORDS.union(pubmed_stopwords))
vectorizer_model = CountVectorizer(
    stop_words = combined_stopwords,
    ngram_range = (1,3),
    min_df = 10)
In [12]:
topic_model.update_topics(abstracts, vectorizer_model=vectorizer_model)

topic_model_keybert.update_topics(abstracts, vectorizer_model=vectorizer_model, 
                            representation_model=representation_model_keybert)

Let's see whether our topics have changed after updating

In [13]:
topic_model.get_topic_info()[1:11]
Out[13]:
Topic Count Name Representation Representative_Docs
1 0 214 0_isolates_virus_strains_infection [isolates, virus, strains, infection, samples,... [chronic wasting disease (cwd) is a fatal prio...
2 1 132 1_visual_children_task_movement [visual, children, task, movement, memory, sti... [objectives: noise often has detrimental effec...
3 2 121 2_images_imaging_image_measurements [images, imaging, image, measurements, method,... [background: endovascular aortic procedures ha...
4 3 116 3_water_concentrations_organic_concentration [water, concentrations, organic, concentration... [quinclorac, a highly selective auxin herbicid...
5 4 115 4_gastric_patients_surgical_surgery [gastric, patients, surgical, surgery, underwe... [the surgical standard for ulcerative colitis ...
6 5 113 5_activity_strains_acid_enzyme [activity, strains, acid, enzyme, ph, compound... [oleaginous fungi are of special interest amon...
7 6 108 6_recurrence_patients_survival_tumor [recurrence, patients, survival, tumor, node, ... [background: no agreement has been made about ...
8 7 106 7_cells_il_cell_cd4 [cells, il, cell, cd4, anti, patients, antigen... [neoplastic disorders sometimes accompany a re...
9 8 105 8_ventricular_aortic_patients_heart [ventricular, aortic, patients, heart, cardiac... [background: we studied the effects of diabete...
10 9 97 9_visual_eye_surgery_disc [visual, eye, surgery, disc, vision, patients,... [aims: to quantify the rates of eye preservati...
In [15]:
topic_model_keybert.get_topic_info()[1:11]
Out[15]:
Topic Count Name Representation Representative_Docs
1 0 214 0_isolates_isolate_pigs_virus [isolates, isolate, pigs, virus, strain, patho... [chronic wasting disease (cwd) is a fatal prio...
2 1 132 1_processing_task_memory_performance [processing, task, memory, performance, discri... [objectives: noise often has detrimental effec...
3 2 121 2_ultrasound_imaging_image_real time [ultrasound, imaging, image, real time, images... [background: endovascular aortic procedures ha...
4 3 116 3_water_environmental_removal_environment [water, environmental, removal, environment, a... [quinclorac, a highly selective auxin herbicid...
5 4 115 4_gastric_resection_surgery_surgical [gastric, resection, surgery, surgical, operat... [the surgical standard for ulcerative colitis ...
6 5 113 5_yeast_enzymes_enzyme_fungi [yeast, enzymes, enzyme, fungi, fungal, substr... [oleaginous fungi are of special interest amon...
7 6 108 6_breast cancer_recurrence_resection_prognostic [breast cancer, recurrence, resection, prognos... [background: no agreement has been made about ...
8 7 106 7_lymphocytes_lymphocyte_cd4_cytokine [lymphocytes, lymphocyte, cd4, cytokine, cytok... [neoplastic disorders sometimes accompany a re...
9 8 105 8_left ventricular_heart failure_heart disease... [left ventricular, heart failure, heart diseas... [background: we studied the effects of diabete...
10 9 97 9_eye_vision_visual_laser [eye, vision, visual, laser, anterior, aqueous... [aims: to quantify the rates of eye preservati...

Evaluation¶

As in the previous exercise, we use coherence to evaluate our models. We used UMass in that exercise. Here we introduce two other coherence metrics, C_V and NPMI. C_V and NPMI, like UMass, calculate co-occurrence (if two words belong to the same topic, they should appear together often) but they use different math to calculate it. We used UMass in our last exercise for convenience because it is fast, and does not need external references. C_V is computationally expensive, but is usually regarded as the 'best'.

In [17]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim import corpora

def get_topic_words(topic_model, top_n=20):
    """Extract top-n words per topic from a BERTopic model, excluding outlier topic -1."""
    topic_words = []
    for topic_id in topic_model.get_topic_freq()["Topic"]:
        if topic_id == -1:
            continue
        words = [w for w, _ in topic_model.get_topic(topic_id)[:top_n]]
        if words:
            topic_words.append(words)
    return topic_words

def tokenize_docs(docs):
    """Simple whitespace tokenizer — replace with your preprocessing if needed."""
    return [doc.lower().split() for doc in docs]

def topic_diversity(topic_words, top_n=20):
    """
    Proportion of unique words across all topic top-word lists.
    Score of 1.0 = all words unique across topics (maximum diversity).
    Score approaching 0 = heavy repetition of the same words across topics.
    """
    all_words = [w for words in topic_words for w in words[:top_n]]
    if not all_words:
        return 0.0
    return round(len(set(all_words)) / len(all_words), 4)

def compute_coherence(topic_words, tokenized_docs, coherence="c_v"):
    """Compute coherence score for a list of topic word lists."""
    dictionary = Dictionary(tokenized_docs)
    corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

    cm = CoherenceModel(
        topics=topic_words,
        texts=tokenized_docs,
        dictionary=dictionary,
        corpus=corpus,
        coherence=coherence,
    )
    return cm.get_coherence(), cm.get_coherence_per_topic()

def evaluate_model(name, topic_model, docs, top_n=20):
    """Run full evaluation for a single BERTopic model."""
    topic_words = get_topic_words(topic_model, top_n=top_n)
    tokenized  = tokenize_docs(docs)

    cv_score,    cv_per_topic    = compute_coherence(topic_words, tokenized, "c_v")
    umass_score, umass_per_topic = compute_coherence(topic_words, tokenized, "u_mass")
    npmi_score,  npmi_per_topic  = compute_coherence(topic_words, tokenized, "c_npmi")
    diversity                    = topic_diversity(topic_words, top_n=top_n)

    topic_freq = topic_model.get_topic_freq()
    n_topics   = len(topic_freq[topic_freq["Topic"] != -1])
    noise_docs  = topic_freq[topic_freq["Topic"] == -1]["Count"].values
    noise_pct   = 100 * noise_docs[0] / len(docs) if len(noise_docs) > 0 else 0.0

    return {
        "model":       name,
        "n_topics":    n_topics,
        "noise_pct":   round(noise_pct, 2),
        "c_v":         round(cv_score, 4),
        "c_npmi":      round(npmi_score, 4),
        "c_umass":     round(umass_score, 4),
        "diversity":  diversity,
        # Per-topic lists for deeper inspection
        "_cv_per_topic":    cv_per_topic,
        "_npmi_per_topic":  npmi_per_topic,
        "_topic_words":     topic_words,
    }
In [18]:
models = {
    "Default":       topic_model,
    "KeyBERT":       topic_model_keybert,
   }

records = []
per_topic_data = {}

for name, model in models.items():
    print(f"Evaluating {name}...")
    result = evaluate_model(name, model, abstracts, top_n=20)
    per_topic_data[name] = {
        "topic_words":    result.pop("_topic_words"),
        "cv_per_topic":   result.pop("_cv_per_topic"),
        "npmi_per_topic": result.pop("_npmi_per_topic"),
    }
    records.append(result)
Evaluating Default...
Evaluating KeyBERT...
In [19]:
df = pd.DataFrame(records).set_index("model")
print("\n── Evaluation Summary ───────────────────────────────────────────────")
print(df.to_string())
── Evaluation Summary ───────────────────────────────────────────────
         n_topics  noise_pct     c_v  c_npmi  c_umass  diversity
model                                                           
Default        61      28.26  0.5226 -0.0552  -4.5024     0.7377
KeyBERT        61      28.26  0.5186 -0.1052  -5.2371     0.8000

For C_V, the closer the number to 1, the better the coherence. For both NPMI and UMass, the closer the number to zero, the better the coherence.

The Default model has better coherence but has less topic diversity. Let's choose it as our model.
First, let's use an LLM to convert the default topic labels to something more intelligible. I'm using MedGemma through Ollama here

In [24]:
import ollama, json

SYSTEM_PROMPT =  '''You are a biomedical expert. Given keywords from a BERTopic topic model trained on PubMed abstracts, 
return a JSON object with the key "topic_label" containing a topic label of NO MORE THAN FIVE WORDS.
Example output: {"topic_label": "Cancer Cell Biology"}'''

def label_topic(topic_id, keywords):
    kw_str = ', '.join(keywords)
    response = ollama.chat(
        model = 'MedAIBase/MedGemma1.5:4b',
        format = 'json',
        options = {'temperature': 0},
        messages = [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': f'Keywords: {kw_str}'}])
    result = json.loads(response['message']['content'])
    label = result['topic_label'].strip()
    words = label.split()
    if len(words) > 5:
        label = ' '.join(words[:5])
    return label  
In [25]:
# Extract keywords from BERTopic and label each topic
topic_labels = {}

for topic_id in topic_model.get_topic_freq()["Topic"]:
    if topic_id == -1:
        continue
    keywords = [word for word, _ in topic_model.get_topic(topic_id)[:20]]
    label = label_topic(topic_id, keywords)
    topic_labels[topic_id] = label
    print(f"Topic {topic_id:>3d} : {label}")

# Apply labels to the model
topic_model.set_topic_labels(topic_labels)
Topic   0 : Viral Infection
Topic   1 : Visual Perception in Children
Topic   2 : Medical Imaging
Topic   3 : Water Treatment
Topic   4 : Surgical Complications in Gastric Patients
Topic   5 : Plant Oil Production
Topic   6 : Breast Cancer Survival
Topic   7 : Immune Cell Activation
Topic   8 : Heart Disease
Topic   9 : Visual Surgery Outcomes
Topic  10 : Medical Education
Topic  11 : Mental Health Assessment
Topic  12 : Ovarian Hormone Levels in Pregnancy
Topic  13 : Protein Structure and Function
Topic  14 : Genetic Syndromes
Topic  15 : Hemorrhage Diagnosis
Topic  16 : HIV Drug Resistance
Topic  17 : Health Services Research
Topic  18 : Dental Implant Materials
Topic  19 : Protein Biochemistry
Topic  20 : Lung Function and Oxygenation
Topic  21 : Sexual Health Education
Topic  22 : Kidney Function and Hypertension
Topic  23 : Breast Cancer Screening
Topic  24 : Knee Joint Reconstruction
Topic  25 : Plant Genetics
Topic  26 : Protein Aggregation
Topic  27 : Dietary Fat Intake
Topic  28 : Nitric Oxide Signaling
Topic  29 : Inflammation in Airway Cells
Topic  30 : Liver Metabolism
Topic  31 : Mass Spectrometry Techniques
Topic  32 : Drug Delivery Systems
Topic  33 : Lipid Metabolism in Mice
Topic  34 : Drug Effects on Neurons
Topic  35 : Coronary Artery Disease
Topic  36 : Pain Management Procedures
Topic  37 : Maternal and Infant Health
Topic  38 : Ion Channel Regulation
Topic  39 : Renal Function and Imaging
Topic  40 : Fungal Pulmonary Disease
Topic  41 : Hepatitis B Serology
Topic  42 : Cell Migration and Matrix Interactions
Topic  43 : Breast Cancer Prognosis
Topic  44 : Synaptic Potentials
Topic  45 : Genetic Susceptibility
Topic  46 : Plant Growth and Physiology
Topic  47 : Obesity and Insulin Resistance
Topic  48 : Mitochondrial Evolution
Topic  49 : Chemotherapy Side Effects
Topic  50 : Spinal Cord Injury
Topic  51 : Neurotransmitter Receptor Binding
Topic  52 : Pancreatic Cancer Cell Biology
Topic  53 : Diabetes Mellitus Research
Topic  54 : Exercise Physiology
Topic  55 : Cellular Processes
Topic  56 : Liver Disease Severity
Topic  57 : Catalysis and Synthesis
Topic  58 : Trauma and Emergency Care
Topic  59 : Wound Healing and Tissue Regeneration
Topic  60 : Zinc Metal Complex

Visualization¶

In [28]:
# Visualizing Topics
import plotly.io as pio
pio.renderers.default='notebook'
topic_model.visualize_topics()
In [29]:
# visualizing documents
# reduce dimensionality of embeddings first, speeds up visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

topic_model.visualize_documents(
    abstracts,
    reduced_embeddings=reduced_embeddings,
    topics=list(range(20)),
    custom_labels=True)
In [30]:
# Visualizing clustering
topic_model.visualize_hierarchy(custom_labels=True)
In [31]:
# Topic similarity heatmap
topic_model.visualize_heatmap(top_n_topics=20, custom_labels=True)
In [32]:
# Top-n words associated with a topic
topic_model.visualize_barchart(
    n_words = 10,
    custom_labels = True,
    height = 500
)

Extracting Information¶

In [34]:
# find similar topics
similar_topics, similarity = topic_model.find_topics('diabetes', top_n=5)
output = topic_model.get_topic(similar_topics[0])
print(output)
[('insulin', np.float64(0.07505057041840049)), ('glucose', np.float64(0.06449658395331292)), ('diabetic', np.float64(0.05262164962008938)), ('rats', np.float64(0.046001592272389244)), ('beta', np.float64(0.03388939750848574)), ('pancreas', np.float64(0.02949886994259258)), ('secretion', np.float64(0.024190927374734113)), ('diabetes', np.float64(0.02414189329120224)), ('mice', np.float64(0.023805660928468596)), ('pancreatic', np.float64(0.022874374276756034))]
In [35]:
# find all representations of a topic
output1 = topic_model.get_topic(15, full=True)
print(output1)
{'Main': [('patients', np.float64(0.027917486608856703)), ('diagnosis', np.float64(0.026417612300799875)), ('bleeding', np.float64(0.024901275791185475)), ('therapy', np.float64(0.015319996802860718)), ('hemorrhage', np.float64(0.015217886572923122)), ('cranial', np.float64(0.014582847977053592)), ('ci', np.float64(0.013687877858091198)), ('cases', np.float64(0.01330872942611377)), ('risk', np.float64(0.01309258588671222)), ('case', np.float64(0.012850121239623225))]}
In [37]:
# find distributions of topics in documents
# first, get the distributions for all documents in the corpus
topic_distr, _ = topic_model.approximate_distribution(abstracts, window=8, stride=4, use_embedding_model=True)
print(topic_distr)
Batches:   0%|          | 0/1427 [00:00<?, ?it/s]
Batches:   0%|          | 0/1426 [00:00<?, ?it/s]
Batches:   0%|          | 0/1388 [00:00<?, ?it/s]
Batches:   0%|          | 0/1422 [00:00<?, ?it/s]
Batches:   0%|          | 0/1411 [00:00<?, ?it/s]
[[0.         0.00631404 0.         ... 0.2550849  0.         0.        ]
 [0.         0.04006515 0.03722432 ... 0.05020912 0.         0.        ]
 [0.         0.         0.00963105 ... 0.01278466 0.         0.        ]
 ...
 [0.         0.         0.         ... 0.00644452 0.         0.        ]
 [0.         0.00488603 0.02214933 ... 0.08309011 0.         0.        ]
 [0.01363206 0.00864306 0.00398407 ... 0.0155118  0.00516646 0.        ]]
In [44]:
# then select a document to visualize
import plotly.graph_objects as go

# make sure to input probabilities for a single document.
output2 = topic_model.visualize_distribution(topic_distr[0], custom_labels=True)
fig = go.Figure(output2)
fig.show()
In [45]:
# for reference here is the corresponding abstract
abstracts[0]
Out[45]:
'we reviewed the patterns of injuries sustained by  consecutive fallers and jumpers in whom primary impact was onto the feet. the fall heights ranged from  to  ft. the  patients sustained  significant injuries. skeletal injuries were most frequent and included  lower extremity fractures, four pelvic fractures, and nine spinal fractures. in two patients, paraplegia resulted. genitourinary tract injuries included bladder hematoma, renal artery transection, and renal contusion. thoracic injuries included rib fractures, pneumothorax, and hemothorax. secondary impact resulted in several craniofacial and upper extremity injuries. chronic neurologic disability and prolonged morbidity were common. one patient died; the patient who fell  ft survived. after initial stabilization, survival is possible after falls or jumps from heights as great as  feet it is important to recognize the skeletal and internal organs at risk from high-magnitude vertical forces.'
In [ ]: